feat(core): bump DataFusion to 49.0.1#1293
Conversation
WalkthroughAdapts code to upstream DataFusion API/AST changes (Literal/ValueWithSpan/ObjectNamePart/FieldRef/ScalarFunctionArgs), preserves spans/metadata through logical-plan rewrites, adjusts MDL analyzer/optimizer rule pipelines and dialect translation, bumps workspace dependencies, and updates tests and sqllogictest expectations. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Frontend as SQL Frontend
participant MDL as MDL Context
participant Analyzer as Analyzer Pipeline
participant Optimizer as Optimizer Pipeline
participant DF as DataFusion AST
User->>Frontend: submit SQL / request transformation
Frontend->>MDL: build LogicalPlan (updated AST shapes)
MDL->>Analyzer: analyze_rule_for_*()
note right of Analyzer: analyzer pipeline changed (removed rules)
Analyzer-->>MDL: analyzed LogicalPlan (spans/metadata preserved)
MDL->>Optimizer: optimize_rule_for_unparsing()
note right of Optimizer: optimizer pipeline changed (removed rules)
Optimizer-->>MDL: optimized LogicalPlan
MDL->>DF: translate to SQL/AST (uses ObjectNamePart, FieldRef, ValueWithSpan)
DF-->>MDL: SQL/AST nodes
MDL-->>Frontend: final SQL/plan
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (13)
wren-core/core/src/mdl/dialect/inner_dialect.rs (3)
185-188: Good adaptation to Expr::Literal(_, _) + minor simplificationMatching the two-field Literal for Utf8/LargeUtf8 is correct for DF 49.x. You can simplify the return to avoid wrapping an Ok around a
?.- | Expr::Literal(ScalarValue::LargeUtf8(Some(s)), _) => { - Ok(self.datetime_field_from_str(s)?) - } + | Expr::Literal(ScalarValue::LargeUtf8(Some(s)), _) => { + self.datetime_field_from_str(s) + }
195-231: Consider stricter validation for WEEK(...) parsingThe WEEK(…) parser accepts any trailing characters after the closing ')' and doesn’t validate unmatched parentheses robustly. Not a blocker, but it can reduce surprising inputs.
- Ensure
end == s.len() - 1after finding ')'.- Validate that
start < endand there’s no extra content outside WEEK(…).
251-254: Typo in local variable name
uppsercaseshould beuppercase. Purely cosmetic.-fn non_uppercase(sql: &str) -> bool { - let uppsercase = sql.to_uppercase(); - uppsercase != sql -} +fn non_uppercase(sql: &str) -> bool { + let uppercase = sql.to_uppercase(); + uppercase != sql +}wren-core/core/src/logical_plan/utils.rs (2)
50-52: List child field aligns with DF 49; add a short noteSwitching to
Field::new("item", data_type, true)matches Arrow/DF conventions (child name “item”, elements nullable). Good change. Consider a short doc comment here to avoid future regressions.
169-171: Potentially invalid Arrow time type mappingMapping
"time"toDataType::Time32(TimeUnit::Nanosecond)is unconventional. In Arrow, nanoseconds typically useTime64(Nanosecond);Time32is generally for Second/Millisecond. Not related to this PR, but worth correcting to avoid downstream kernel incompatibilities and surprising behavior in parquet/IPC.- "time" => DataType::Time32(TimeUnit::Nanosecond), // chose the smallest time unit + "time" => DataType::Time64(TimeUnit::Nanosecond),If you prefer 32-bit, use
SecondorMillisecond.wren-core/core/src/mdl/dialect/utils.rs (1)
58-63: Correct ObjectName construction under new ASTWrapping Ident in
ObjectNamePart::Identifieris the right adjustment for DF 49.0.1. One minor improvement: if you have spans available upstream, consider threading them instead ofSpan::empty()to preserve diagnostics.wren-core/sqllogictest/test_files/view.slt (1)
19-23: Don’t silently comment out failing logic tests—gate or trackUnderstandable to disable while fixing push_down_filter on views. Please either:
- Add a tracking issue and reference it here, or
- Use a harness-supported skip/ignore mechanism (if available) instead of comments, so the test remains visible in reports.
I can help convert this to a gated/ignored test if the framework supports it.
wren-core/core/src/logical_plan/optimize/simplify_timestamp.rs (1)
146-154: Literal arm updated; consider adding tests for both Cast and TryCastUpdating to
Expr::Literal(value, _)is correct. Consider adding unit tests that cover:
- Cast/TryCast of timestamp literals across all four units
- NamePreserver preserving aliases through simplification
I can scaffold tests under core/src/logical_plan/optimize/tests if helpful.
wren-core-py/src/context.rs (1)
251-252: Consider preserving numeric formatting semantics for newly introduced LIMIT.You default the is flag to false for new limits. If your AST uses this flag to track formatting (e.g., integer vs decimal/long), consider setting it to a sensible default (likely false is fine) or making it configurable if downstream expects a specific representation.
wren-core/core/src/logical_plan/analyze/model_anlayze.rs (1)
715-726: Preserve Column spans when relation doesn’t belong to MDL.In the non-MDL path, rewrite_column_qualifier currently rebuilds a Column with Spans::new(), which drops original spans while marking Transformed::no. Capture and pass through the original spans to avoid span loss.
Apply this minimal change to preserve spans without broad refactors:
- Expr::Column(Column { relation, name, .. }) => { - if let Some(relation) = relation { + Expr::Column(Column { relation, name, spans, .. }) => { + if let Some(relation) = relation { Ok(self.rewrite_column_qualifier(relation, name, alias_model)) } else { @@ - Ok(Transformed::yes(ident)) + Ok(Transformed::yes(ident)) } }and adjust rewrite_column_qualifier’s non-MDL branch to use the provided spans (pass it in or return Transformed::no with the original Column if available):
- } else { - Transformed::no(Expr::Column(Column { - relation: Some(relation), - name, - spans: Spans::new(), - })) - } + } else { + // Prefer to return the original Expr::Column with its spans (if available) + Transformed::no(Expr::Column(Column { + relation: Some(relation), + name, + spans, // preserve existing spans + })) + }If threading spans into rewrite_column_qualifier complicates the signature, alternatively decide the Transformed::no branch directly in map_column_and_rewrite_qualifier and return the original Expr::Column there.
wren-core/core/src/mdl/function.rs (2)
357-366: Window UDF field now returns FieldRef; derive return type from input_fields.This matches the new API and avoids reliance on input_types. One suggestion: consider nullability propagation (e.g., SameAsInput should often inherit input nullability).
Example tweak:
- Ok(Field::new(field_args.name(), return_type, false).into()) + let nullable = match self.return_type { + ReturnType::SameAsInput => field_args.input_fields().first().map(|f| f.is_nullable()).unwrap_or(true), + _ => false, + }; + Ok(Field::new(field_args.name(), return_type, nullable).into())
140-147: Support LargeList and FixedSizeList for SameAsInputFirstArrayElement.Currently only List is handled; add LargeList and FixedSizeList to avoid not_impl_err when callers pass those array types.
- ReturnType::SameAsInputFirstArrayElement => { + ReturnType::SameAsInputFirstArrayElement => { if arg_types.is_empty() { return not_impl_err!("No input type"); } - if let DataType::List(field) = &arg_types[0] { - field.data_type().clone() - } else { - return not_impl_err!("Input type is not array"); - } + match &arg_types[0] { + DataType::List(field) | + DataType::LargeList(field) => field.data_type().clone(), + DataType::FixedSizeList(field, _) => field.data_type().clone(), + _ => return not_impl_err!("Input type is not array"), + } }wren-core/core/src/mdl/dialect/wren_dialect.rs (1)
42-47: identifier_quote_style compiles Regex on every call; cache it.Minor perf nit: compile the regex once statically.
For example:
-use regex::Regex; +use regex::Regex; +use once_cell::sync::Lazy; +static IDENT_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^[a-zA-Z_][a-zA-Z0-9_]*$").unwrap()); @@ - let identifier_regex = Regex::new(r"^[a-zA-Z_][a-zA-Z0-9_]*$").unwrap(); - if ALL_KEYWORDS.contains(&identifier.to_uppercase().as_str()) - || !identifier_regex.is_match(identifier) + if ALL_KEYWORDS.contains(&identifier.to_uppercase().as_str()) + || !IDENT_RE.is_match(identifier) || non_lowercase(identifier)Ensure once_cell is available in the crate; otherwise use lazy_static.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
wren-core-py/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
wren-core-py/src/context.rs(2 hunks)wren-core/Cargo.toml(1 hunks)wren-core/core/src/logical_plan/analyze/model_anlayze.rs(5 hunks)wren-core/core/src/logical_plan/optimize/simplify_timestamp.rs(2 hunks)wren-core/core/src/logical_plan/utils.rs(2 hunks)wren-core/core/src/mdl/context.rs(2 hunks)wren-core/core/src/mdl/dialect/inner_dialect.rs(1 hunks)wren-core/core/src/mdl/dialect/utils.rs(2 hunks)wren-core/core/src/mdl/dialect/wren_dialect.rs(2 hunks)wren-core/core/src/mdl/function.rs(4 hunks)wren-core/core/src/mdl/mod.rs(21 hunks)wren-core/sqllogictest/test_files/view.slt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
wren-core/core/src/mdl/context.rs (2)
wren-core/core/src/mdl/mod.rs (1)
catalog(294-296)wren-core-base/manifest-macro/src/lib.rs (1)
view(347-366)
🪛 Gitleaks (8.27.2)
wren-core/core/src/mdl/mod.rs
2345-2345: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
2351-2351: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: test
- GitHub Check: ci
- GitHub Check: cargo test (macos-aarch64)
- GitHub Check: cargo test (win64)
- GitHub Check: cargo check
- GitHub Check: cargo test (macos)
🔇 Additional comments (22)
wren-core/core/src/logical_plan/utils.rs (1)
298-303: AllExpr::Literalpatterns confirmed updated
A repository-wide scan of everyExpr::Literalusage shows all are now using two arguments ((_, _)), and there are no remaining single-argument patterns to update:
- core/src/logical_plan/utils.rs:
Expr::Literal(_, _)- core/src/logical_plan/optimize/simplify_timestamp.rs:
matches!(sub_expr.as_ref(), Expr::Literal(_, _))Expr::Literal(value, _)- core/src/mdl/dialect/inner_dialect.rs:
Expr::Literal(ScalarValue::Utf8(Some(s)), _)/Expr::Literal(ScalarValue::LargeUtf8(Some(s)), _)No further changes are required.
wren-core/core/src/mdl/dialect/utils.rs (1)
21-21: Import of ObjectNamePart is required and correctThe import aligns with the new AST form for ObjectName segments. Looks good.
wren-core/core/src/logical_plan/optimize/simplify_timestamp.rs (1)
135-141: Literal shape check updated correctlyThe guard
matches!(sub_expr.as_ref(), Expr::Literal(_, _))is the right fix for DF 49. Behavior unchanged. Nice.wren-core-py/src/context.rs (1)
33-33: Good: switched to ValueWithSpan to keep literal spans intact.Using ValueWithSpan in the import set aligns with the PR-wide span/metadata preservation. No concerns.
wren-core/Cargo.toml (2)
11-11: Repository URL fix looks good.
16-16: Bump async-trait to 0.1.89: compatible and non-breaking here.wren-core/core/src/logical_plan/analyze/model_anlayze.rs (4)
104-108: Robust Subquery matching with .. to accommodate new fields.Good adaptation for additional Subquery fields (e.g., spans). This prevents brittle matches.
446-447: Updated Join field null_equality: aligned with DataFusion v49.Field rename addressed correctly; functional behavior preserved.
596-611: Preserve Subquery spans on rewrite.Carrying spans through the Subquery transformation is consistent with the PR goal of span/metadata preservation.
731-740: Alias now preserves metadata: good.Retaining metadata in Alias during expression rewrites avoids losing documentation or hints attached upstream.
wren-core/core/src/mdl/function.rs (3)
1-1: Switch to FieldRef: correct for DataFusion v49 Arrow types.Importing FieldRef and using it where needed aligns with upstream type changes.
228-231: Updated ScalarUDFImpl to invoke_with_args: correct trait adaptation.Returning an internal_err ensures the bypass UDF is never executed at runtime, which is the intended design.
604-605: Test update for list element Field(name="item", nullable=true) is correct.Matches DF v49 semantics; good coverage.
wren-core/core/src/mdl/dialect/wren_dialect.rs (1)
21-21: Narrowed imports and reliance on ALL_KEYWORDS/Expr: cleaner surface.wren-core/core/src/mdl/context.rs (4)
17-17: Import path correctly updated for DataFusion 49.0.1The CatalogProvider import path change from
datafusion::catalog_commontodatafusion::catalogaligns with the upstream DataFusion 49.0.1 reorganization.
303-303: ViewTable constructor change handled correctlyThe switch from
ViewTable::try_newtoViewTable::newis appropriate since the new API no longer returns a Result. This aligns with DataFusion 49.0.1 API changes where view creation is now infallible.
169-187: Simplified analyzer rules for local runtimeThe removal of
ExpandWildcardRuleandInlineTableScananalyzer rules appears intentional based on the PR context. These rules are likely now handled differently in DataFusion 49.0.1 or are no longer necessary for the local runtime mode.
220-260: Ignore incorrect assertion aboutUnwrapCastInComparisonremovalA search of the current codebase shows no references to
UnwrapCastInComparison, so it was never part of the unparser pipeline and hasn’t been removed. The suggestion to verify its replacement isn’t applicable here.Likely an incorrect or invalid review comment.
wren-core/core/src/mdl/mod.rs (4)
695-695: Test assertions updated for new SQL generation behaviorThe test expectations now include explicit
CAST(column AS BIGINT)for numeric comparisons andRIGHT OUTER JOINinstead ofRIGHT JOIN. These changes reflect DataFusion's more explicit type casting in the generated SQL for better cross-database compatibility.Also applies to: 707-707, 1173-1173
1739-1739: Numeric predicates now include explicit type castingThe generated SQL now includes
CAST(column AS BIGINT)for integer comparisons (e.g.,CAST(customer.c_custkey AS BIGINT) = 1). This ensures type safety and consistency across different database backends.Also applies to: 1774-1774, 1823-1826, 1874-1877
2345-2345: False positive: Test data strings are not API keysThe static analysis tool incorrectly flagged the test strings
'Gura'andsession_useras potential API keys. These are clearly test data values used in row-level access control testing scenarios.Also applies to: 2351-2351
1173-1173: UNNEST test output includes new aliasing structureThe UNNEST query now generates a
derived_projectionalias with explicit column naming. This change improves SQL clarity and ensures proper column referencing in complex queries.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
ibis-server/tests/conftest.py (1)
14-14: Hard-coded function count updated to 283 — consider deriving dynamically to reduce brittlenessThis constant tends to change with DataFusion bumps. If feasible, derive the baseline count at runtime (e.g., via an API call or a helper in tests) or centralize it in a single test utility to avoid manual updates in multiple places.
If you keep the constant, ensure all tests referencing DataFusion function counts have been updated consistently (search for “len(functions) ==” and “DATAFUSION_FUNCTION_COUNT” across the repo).
wren-core-py/tests/test_modeling_core.py (1)
139-139: Replace brittle exact counts with a dynamic calculation based on union of built-ins and CSV-defined functionsExact numbers will drift with DataFusion releases. Compute the expected length as the union of:
- base registry functions (no CSV)
- names in tests/functions.csv
This keeps the assertion stable while still verifying the merge logic.
Apply this diff within the current test to remove hard-coded counts:
@@ - functions = session_context.get_available_functions() - assert len(functions) == 290 + functions = session_context.get_available_functions() + # Dynamically derive expected count: DataFusion registry ∪ CSV custom functions + base_functions = SessionContext(manifest_str, None).get_available_functions() + base_names = {f.name for f in base_functions} + import csv # local import to keep test module scope clean + with open(path, newline="") as f: + csv_names = {row["name"] for row in csv.DictReader(f) if row.get("name")} + assert len(functions) == len(base_names | csv_names) @@ - functions = session_context.get_available_functions() - assert len(functions) == 283 + functions = session_context.get_available_functions() + assert len(functions) == len(base_names)If you prefer a dedicated helper, I can factor this into a small utility used by both this test and ibis-server so future bumps don’t require manual edits. Do you want me to open a follow-up PR for that?
Also applies to: 151-151
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
wren-core-py/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
ibis-server/tests/conftest.py(1 hunks)wren-core-py/Cargo.toml(1 hunks)wren-core-py/src/context.rs(2 hunks)wren-core-py/tests/test_modeling_core.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- wren-core-py/src/context.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: cargo test (amd64)
- GitHub Check: ci
- GitHub Check: cargo test (win64)
🔇 Additional comments (1)
wren-core-py/Cargo.toml (1)
22-22: Verify Tokio version & feature consistency inwren-coreworkspacePlease confirm the following based on the audit of all
tokiodeclarations:
- In
wren-core/Cargo.toml(the workspace root):
tokio = { version = "1.46", features = ["rt", "rt-multi-thread", "macros"] }- In
wren-core/core/Cargo.toml:
tokio = { workspace = true, features = ["rt", "rt-multi-thread", "macros"] }- In
wren-core/benchmarks/Cargo.tomlandwren-core/wren-example/Cargo.toml:
tokio = { workspace = true }(inherits features from the workspace root)- In
wren-core/sqllogictest/Cargo.toml:
- Under regular
[dependencies]:tokio = { workspace = true }(inherits all features)- Under
[[test]]deps:tokio = { workspace = true, features = ["rt-multi-thread"] }
– Note: this overrides the workspace’s full feature set and omits"macros". If your tests use#[tokio::test]or any Tokio macros, you’ll need to add"macros"here.Separately,
wren-core-py/Cargo.tomlis not part of that workspace—it pinstokio = "1.46.0"with default features. If you intend for the Python-binding crate to share exactly the same version and feature flags, you can either add it to the workspace and switch toworkspace = trueor continue managing its dependency explicitly.Optional refactoring recommendation:
- Centralize the Tokio dependency under a
[workspace.dependencies]section in the workspace-rootwren-core/Cargo.toml(e.g.tokio = { version = "1.46.0", features = ["rt", "rt-multi-thread", "macros"] }). This will ensure all member crates (including tests) stay in sync without per-crate repetition.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
ibis-server/tests/routers/v3/connector/local_file/test_functions.py (2)
56-58: Fix stale inline comment to match the new expectation (429 → 427).The inline comment still says 429 while the assertion expects 427. This will confuse future readers.
Apply this diff:
- # 429 is the number of functions in `resources/function_list/duckdb.csv` file excluded the default functions in DataFusion. - assert len(result) == DATAFUSION_FUNCTION_COUNT + 427 + # 427 is the number of functions in resources/function_list/duckdb.csv, excluding the default DataFusion functions. + assert len(result) == DATAFUSION_FUNCTION_COUNT + 427
52-58: Make the size assertion data-driven to avoid churn on every DataFusion bump.Right now the test hard-codes “427”. You can compute the delta by comparing the CSV function names against the base function names returned by the API (with remote list unset). This keeps the test stable when DataFusion adds/removes built-ins or when the CSV changes.
Proposed refactor within this block:
- config.set_remote_function_list_path(function_list_path) - response = await client.get(url=f"{base_url}/functions") - assert response.status_code == 200 - result = response.json() - # 429 is the number of functions in `resources/function_list/duckdb.csv` file excluded the default functions in DataFusion. - assert len(result) == DATAFUSION_FUNCTION_COUNT + 427 + # Capture the base set of function names from DataFusion alone + base_names = {f["name"] for f in result} + base_len = len(base_names) + + # Enable remote function list and fetch again + config.set_remote_function_list_path(function_list_path) + response = await client.get(url=f"{base_url}/functions") + assert response.status_code == 200 + result = response.json() + + # Compute expected delta from the CSV by excluding names already present in DataFusion + csv_names = _read_duckdb_function_names(function_list_path) + expected_delta = len(csv_names - base_names) + assert len(result) == base_len + expected_deltaAdd this small helper (outside this function, anywhere in the module), and update imports accordingly:
import csv import os def _read_duckdb_function_names(function_list_dir: str) -> set[str]: csv_path = os.path.join(function_list_dir, "duckdb.csv") with open(csv_path, newline="", encoding="utf-8") as f: return {row["name"] for row in csv.DictReader(f)}This mirrors the server’s de-duplication by name and removes the magic number.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
ibis-server/tests/routers/v3/connector/local_file/test_functions.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: cargo test (macos)
- GitHub Check: cargo test (macos-aarch64)
- GitHub Check: cargo test (win64)
- GitHub Check: cargo check
- GitHub Check: test
- GitHub Check: ci
|
Thanks @goldmedal |
Description
Summary by CodeRabbit